Mental Health in Tech Data Analysis

Table of contents

  • Notebook Initialization and Exploratory Data Analysis (EDA)
    • Setting up the Coding Environment
      • Importing Libraries
      • Importing Functions
    • Loading the Data
      • Data Brief
      • Cleaning the Data
    • Data Exploration Overview
      • Preliminary Plan for Data Exploration
      • Basic Exploration
      • Detailed Exploration
        • Respondent Characteristics Analysis
        • Mental Health Questions Response Analysis
        • Relationship Analysis
  • Conclusions
    • Summary
    • Insights
    • Potential Areas for Investigation
    • Recommendations

¶

Notebook Initialization and Exploratory Data Analysis (EDA)

¶

Setting up the Coding Environment

¶

Importing Libraries

In [ ]:
# %pip install python-dotenv sqlite3 pandas numpy scipy statsmodels plotly
In [ ]:
from dotenv import load_dotenv

load_dotenv()

from sqlalchemy import create_engine
import sqlite3
from math import sqrt
import pandas as pd
import numpy as np

from scipy.stats import pointbiserialr, f_oneway, chi2_contingency, spearmanr
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from sklearn.metrics import r2_score

import plotly
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
import plotly.subplots as sp
from plotly.subplots import make_subplots

pio.renderers.default = "notebook"
plotly.offline.init_notebook_mode()
¶

Importing Functions

In [ ]:
from utils.functions import (
    clean_gender_data,
    clean_age_data,
    clean_to_numerical,
    database_filtering,
)

¶

Loading the Data

In [ ]:
cnx = sqlite3.connect("./datasets/mental_health.sqlite")

df = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table'", cnx)

print(df)
       name
0    Answer
1  Question
2    Survey
In [ ]:
answer = pd.read_sql_query("SELECT * FROM Answer", cnx)
question = pd.read_sql_query("SELECT * FROM Question", cnx)
survey = pd.read_sql_query("SELECT * FROM Survey", cnx)

display(answer.head())
display(question.head())
display(survey.head())
AnswerText SurveyID UserID QuestionID
0 37 2014 1 1
1 44 2014 2 1
2 32 2014 3 1
3 31 2014 4 1
4 31 2014 5 1
questiontext questionid
0 What is your age? 1
1 What is your gender? 2
2 What country do you live in? 3
3 If you live in the United States, which state ... 4
4 Are you self-employed? 5
SurveyID Description
0 2014 mental health survey for 2014
1 2016 mental health survey for 2016
2 2017 mental health survey for 2017
3 2018 mental health survey for 2018
4 2019 mental health survey for 2019
¶

Data Brief

There are 3 tables:

  1. Answer - Answer data [AnswerText, SurveyID, UserID, QuestionID]
  2. Question - Question data [questiontext, questionid]
  3. Survey - Survey data [SurveyID, Description]
¶

Cleaning the Data

Checking for duplicates, missing values, outliers

Missing Values and Duplicates Across Columns
JUSTIFICATION FOR PROCESSING:

  • Missing values in a dataset can lead to inaccurate or misleading statistics and machine learning model predictions. They can occur due to various reasons such as data entry errors, failure to collect information, etc. Depending on the nature and extent of these missing values, different strategies can be employed to handle them.
  • Duplicate values in a dataset can occur due to various reasons such as data entry errors, merging of datasets, etc. Duplicates can lead to biased or incorrect results in data analysis. Therefore, it’s important to identify and remove duplicates.
In [ ]:
print("Missing values in Answer table:")
print(answer.isnull().sum())

print("\nMissing values in Question table:")
print(question.isnull().sum())

print("\nMissing values in Survey table:")
print(survey.isnull().sum())

print("\nDuplicates in Answer table:")
print(answer.duplicated().sum())

print("\nDuplicates in Question table:")
print(question.duplicated().sum())

print("\nDuplicates in Survey table:")
print(survey.duplicated().sum())
Missing values in Answer table:
AnswerText    0
SurveyID      0
UserID        0
QuestionID    0
dtype: int64

Missing values in Question table:
questiontext    0
questionid      0
dtype: int64

Missing values in Survey table:
SurveyID       0
Description    0
dtype: int64

Duplicates in Answer table:
0

Duplicates in Question table:
0

Duplicates in Survey table:
0

Chosen Strategy for Organizing Tables:¶

1. Merging Tables:

  • The three tables are merged based on the QuestionID (referred to as questionid in the Question table) and SurveyID values.
  • The rows are sorted in order of generality, starting with the most generic data (SurveyID), followed by QuestionID, and ending with the most detailed data (UserID).
  • Merging the tables streamlines the workflow. Since the tables didn't have many columns initially, merging them can facilitate the use of data for various comparisons.
In [ ]:
merged_table = pd.merge(
    answer, question, left_on="QuestionID", right_on="questionid", how="outer"
)
merged_table.drop("questionid", axis=1, inplace=True)

merged_table = pd.merge(merged_table, survey, on="SurveyID", how="outer")
merged_table.rename(columns={"questiontext": "QuestionText"}, inplace=True)

merged_table["SurveyID"] = merged_table["SurveyID"].astype(int)
merged_table["QuestionID"] = merged_table["QuestionID"].astype(int)
merged_table["UserID"] = merged_table["UserID"].astype(int)

merged_table.sort_values(by=["SurveyID", "QuestionID", "UserID"], inplace=True)
merged_table = merged_table.reset_index(drop=True)

2. Preparing the data for display:

  • Mapping the gender values to fit into one of 4 categories - Male, Female, LGBTQIA+, Unknown
  • Converting the values relating to age (QuestionID = 1) to integer and removing the values that are lower than 18 and higher than 70 - most probably a mistaken ones
  • Converting the values in questions no. 7, 12, 28, 98, 56, 102 to numbers.
In [ ]:
processed_dataset = merged_table.copy()

processed_dataset.loc[
    processed_dataset["QuestionID"] == 2, "AnswerText"
] = clean_gender_data(
    processed_dataset[processed_dataset["QuestionID"] == 2], "AnswerText"
)
processed_dataset.loc[
    processed_dataset["QuestionID"] == 1, "AnswerText"
] = clean_age_data(
    processed_dataset[processed_dataset["QuestionID"] == 1], "AnswerText"
)
processed_dataset.loc[
    processed_dataset["QuestionID"] == 7, "AnswerText"
] = clean_to_numerical(
    processed_dataset[processed_dataset["QuestionID"] == 7], "AnswerText"
)
processed_dataset.loc[
    processed_dataset["QuestionID"] == 12, "AnswerText"
] = clean_to_numerical(
    processed_dataset[processed_dataset["QuestionID"] == 12], "AnswerText"
)
processed_dataset.loc[
    processed_dataset["QuestionID"].isin([28, 98]), "AnswerText"
] = clean_to_numerical(
    processed_dataset[processed_dataset["QuestionID"].isin([28, 98])], "AnswerText"
)
processed_dataset.loc[
    processed_dataset["QuestionID"].isin([56, 102]), "AnswerText"
] = clean_to_numerical(
    processed_dataset[processed_dataset["QuestionID"].isin([56, 102])], "AnswerText"
)

processed_dataset.head()
c:\Users\aga\Documents\GitHub\athiel-DA.1.5\utils\functions.py:60: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[ ]:
AnswerText SurveyID UserID QuestionID QuestionText Description
0 37 2014 1 1 What is your age? mental health survey for 2014
1 44 2014 2 1 What is your age? mental health survey for 2014
2 32 2014 3 1 What is your age? mental health survey for 2014
3 31 2014 4 1 What is your age? mental health survey for 2014
4 31 2014 5 1 What is your age? mental health survey for 2014

3. Prepare the new file for clean DataFrame

  • Export processed_dataset to .sqlite
  • Form a connection to database
In [ ]:
engine = create_engine("sqlite:///./datasets/mental_health_clean.sqlite")

processed_dataset.to_sql("Answer", engine, if_exists="replace", index=False)

engine = create_engine("sqlite:///./datasets/mental_health_clean.sqlite")
In [ ]:
cnx2 = sqlite3.connect("./datasets/mental_health_clean.sqlite")

OUTCOMES:

  • Neither missing values nor duplicates were found in the datasets.
  • Due to the small number of columns and relatively small number of rows in the datasets, the decision was made to reorganize the data into one table. This could potentially facilitate analysis in further steps.

Outliers in Numeric Columns
JUSTIFICATION FOR PROCESSING: It is hypothesized that the presence of outliers in the dataset may significantly impact the distribution of key variables. Outliers are expected to exhibit values that deviate substantially from the typical range, potentially influencing summary statistics and visual representations. The analysis aims to identify and understand the nature of outliers, exploring their potential impact on data integrity and the overall statistical profile.

In [ ]:
prism_palette = px.colors.qualitative.Prism

fig_question = px.box(
    processed_dataset,
    x="SurveyID",
    y="QuestionID",
    color="SurveyID",
    title="Box Plot of QuestionID Across Surveys",
    labels={"SurveyID": "Survey Year", "QuestionID": "Question ID"},
    color_discrete_sequence=prism_palette,
)

fig_user = px.box(
    processed_dataset,
    x="SurveyID",
    y="UserID",
    color="SurveyID",
    title="Box Plot of UserID Across Surveys",
    labels={"SurveyID": "Survey Year", "UserID": "User ID"},
    color_discrete_sequence=prism_palette,
)

fig_question.show()
fig_user.show()

OUTCOMES:

  • The rows identified as outliers were not found in a dataset.
  1. Number of Surveys: Mental health surveys were conducted annually from 2014 to 2019, with the exception of 2015.
  2. Number of Questions: The number of questions varied over the years. The most, 118, were asked in 2016, followed by 103 in 2014. From 2017 onwards, the questionnaire consistently contained 89 questions.
  3. Number of Participants: The survey in 2016 had the highest participation, with 1,432 respondents. This was followed by 1,260 respondents in 2014. Since 2016, the number of respondents has been on a downward trend, with counts of 755, 416, and 351 for the years 2017, 2018, and 2019, respectively.


¶

Data Exploration Overview

Scrutinizing the dataset to identify key patterns, relationships, and trends. This process aids in detecting significant variables and anomalies, leading to more accurate predictions and insights.

¶

Preliminary Plan for Data Exploration

Basic Exploration

  1. Utilize the describe function to provide an overview of numerical and categorical features in each dataset.

Detailed Exploration

  1. Respondent Characteristics Analysis:

    • Age Distribution
    • Gender Distribution
    • Country Distribution
    • US State Distribution
  2. Mental Health Questions Response Analysis:

    • Treatment: Analysis of responses to 'Seeking Mental Health Treatment in the Past'
    • Interview: Analysis of responses to 'Bringing Up Mental Health Issue Over Interview'
    • Discussion: Analysis of responses to 'Fear of Negative Consequences in Mental Health Discussions'
    • Handling: Analysis of responses to 'Witnessed Poor Handling of Mental Health Issues at Work'
  3. Relationship Analysis:

    • Age - Treatment Relationship
    • Age - Interview Relationship
    • Age - Discussion Relationship
    • Age - Handling Relationship
    • Gender - Treatment Relationship
    • Gender - Interview Relationship
    • Gender - Discussion Relationship
    • Gender - Handling Relationship
¶

Basic Exploration

In [ ]:
print("Processed Dataset")
print(processed_dataset.describe(include=["object"]).T)
Processed Dataset
               count unique                            top   freq
AnswerText    236898   4118                             -1  56110
QuestionText  236898    105              What is your age?   4218
Description   236898      5  mental health survey for 2016  88238

OUTCOMES:

  • AnswerText: There are 236,898 rows in total, and "-1" appears 56,598 times.

  • QuestionText: There are 236,898 rows in total, and the most frequent question is "What is your age?" which appears 4,218 times.

  • Description: There are 236,898 rows in total, and the most frequent description is "mental health survey for 2016", which appears 88,238 times.
These are preliminary observations based on the provided summary statistics. For a more detailed analysis, a deeper examination of the actual data, including visualizations, will be conducted. Since these are assumptions, the actual interpretation of the data may vary depending on the context and the source of the data. These factors will be further assessed during the Exploratory Data Analysis (EDA) stage.

¶

Detailed Exploration

¶
Respondent Characteristics Analysis

HYPOTHESIS: In the field of IT, professionals may exhibit significant differences in various aspects such as job roles, experience levels, and backgrounds. This analysis hypothesizes that exploring these differences can enhance our understanding of their diverse responses to the survey topics. The validity of this hypothesis will be tested in the subsequent analysis.

Age Distribution

In [ ]:
query_age_distribution_tech = """
    SELECT SurveyID,
        CASE 
            WHEN AnswerText BETWEEN 18 AND 24 THEN '18-24'
            WHEN AnswerText BETWEEN 25 AND 34 THEN '25-34'
            WHEN AnswerText BETWEEN 35 AND 44 THEN '35-44'
            WHEN AnswerText BETWEEN 45 AND 54 THEN '45-54'
            WHEN AnswerText BETWEEN 55 AND 64 THEN '55-64'
            WHEN AnswerText BETWEEN 65 AND 99 THEN 'Over 65 '
        END AS age_group,  
        COUNT(*) AS count
    FROM Answer
    WHERE SurveyID IN (2014, 2016, 2017, 2018, 2019)
        AND (QuestionID = 1 OR QuestionID IN (9, 13))  -- Include Question 1, 9, and 13
        AND (AnswerText > 18 OR AnswerText IN (-1, 1))  -- Include -1 and 1 for both questions
    GROUP BY SurveyID, age_group
    ORDER BY SurveyID, age_group; 
"""

age_survey_distribution_tech = pd.read_sql(query_age_distribution_tech, cnx2)

max_y_value = 800

fig = px.line(
    age_survey_distribution_tech,
    x="age_group",
    y="count",
    color="SurveyID",
    labels={"age_group": "Age Group", "count": "Count", "SurveyID": "Survey Year"},
    title="Count of Users by Age Group and Tech Employment (Questions 1, 9, and 13)",
    category_orders={
        "age_group": ["18-24", "25-34", "35-44", "45-54", "55-64", "Over 65 "]
    },
    color_discrete_sequence=prism_palette,
)

fig.update_yaxes(range=[0, max_y_value])

fig.show()

Gender Distribution

In [ ]:
query_gender_distribution = """
    SELECT SurveyID, 
        SUM(CASE WHEN AnswerText = 'Male' THEN 1 ELSE 0 END) AS male_count,
        SUM(CASE WHEN AnswerText = 'Female' THEN 1 ELSE 0 END) AS female_count,
        SUM(CASE WHEN AnswerText = 'LGBTQIA+' THEN 1 ELSE 0 END) AS lgbtqia_count,
        SUM(CASE WHEN AnswerText = 'Unknown' THEN 1 ELSE 0 END) AS unknown_count
    FROM Answer
    WHERE QuestionID == 2 OR (QuestionID == 9 AND AnswerText == 1) OR (QuestionID == 13 AND AnswerText == 1)
    GROUP BY SurveyID
"""

df = pd.read_sql(query_gender_distribution, cnx2)

prism_colors = [
    "#5f4690",
    "#1d6a96",
    "#38a6a4",
    "#0f8554",
    "#73af48",
    "#edac08",
    "#e17a05",
    "#cc513e",
    "#94346e",
    "#6f4070",
    "#666666",
]

fig = px.bar(
    df,
    x="SurveyID",
    y=["male_count", "female_count", "lgbtqia_count", "unknown_count"],
    labels={"value": "Count", "variable": "Gender"},
    title="Gender Distribution in Tech Surveys",
    color_discrete_sequence=prism_colors,
)

fig.update_layout(
    barmode="group", xaxis_title="Survey ID", yaxis_title="Count", legend_title="Gender"
)

fig.show()

Country Distribution

In [ ]:
query_country_distribution = """
    SELECT SurveyID, AnswerText AS Country, COUNT(*) AS Employee_count
    FROM Answer
    WHERE QuestionID = 50
    GROUP BY SurveyID, Country
    ORDER BY SurveyID, Employee_count DESC
"""

df = pd.read_sql(query_country_distribution, cnx2)

df["Employee_count"] = df["Employee_count"].astype(int)

top_countries_by_survey = (
    df.groupby("SurveyID")
    .apply(lambda x: x.nlargest(10, "Employee_count"))
    .reset_index(drop=True)
)

fig = sp.make_subplots(
    rows=1,
    cols=1,
    subplot_titles=["Top 10 Country of Work Distribution in Tech Surveys"],
)

buttons = [
    dict(
        label="All",
        method="update",
        args=[{"visible": [True] * len(top_countries_by_survey)}],
    )
]

for survey_id in top_countries_by_survey["SurveyID"].unique():
    button = dict(
        label=str(survey_id),
        method="update",
        args=[
            {
                "visible": [
                    survey_id == sid
                    for sid in top_countries_by_survey["SurveyID"].unique()
                ]
            }
        ],
    )
    buttons.append(button)

fig.update_layout(
    updatemenus=[
        dict(
            type="buttons",
            showactive=True,
            buttons=buttons,
            x=0.1,
            xanchor="left",
            y=1.15,
            yanchor="top",
        )
    ]
)

for survey_id, color in zip(top_countries_by_survey["SurveyID"].unique(), prism_colors):
    data = top_countries_by_survey[top_countries_by_survey["SurveyID"] == survey_id]
    trace = go.Bar(
        x=data["Country"],
        y=data["Employee_count"],
        name=f"Survey {survey_id}",
        marker_color=color,
    )
    fig.add_trace(trace)

fig.update_layout(
    xaxis_title="Country",
    yaxis_title="Number of Employees",
    title_x=0.5,
)

fig.show()
In [ ]:
df["Employee_count"] = df["Employee_count"].astype(int)
total_respondents = df["Employee_count"].sum()

top_countries = df.groupby("Country").sum().nlargest(10, "Employee_count")

for country, count in top_countries.iterrows():
    percentage = (count["Employee_count"] / total_respondents) * 100
    print(
        f"{country}: {count['Employee_count']} respondents ({percentage:.2f}% of total)"
    )
United States of America: 1872 respondents (63.29% of total)
United Kingdom: 300 respondents (10.14% of total)
Canada: 122 respondents (4.12% of total)
Germany: 90 respondents (3.04% of total)
Netherlands: 70 respondents (2.37% of total)
Australia: 50 respondents (1.69% of total)
India: 39 respondents (1.32% of total)
France: 34 respondents (1.15% of total)
Brazil: 30 respondents (1.01% of total)
Spain: 24 respondents (0.81% of total)
In [ ]:
query_users_participated = """
    SELECT UserID, AnswerText AS Country
    FROM Answer
    WHERE QuestionID = 50
"""

df = pd.read_sql(query_users_participated, cnx2)

total_unique_respondents = df["UserID"].nunique()

df_countries = df.groupby("Country").size().reset_index(name="Employee_count")
top_countries = df_countries.nlargest(4, "Employee_count")

for country, count in top_countries.iterrows():
    percentage = (count["Employee_count"] / total_unique_respondents) * 100
    print(
        f"{country}: {count['Employee_count']} respondents ({percentage:.2f}% of total unique respondents)"
    )

print(f"\nTotal unique respondents: {total_unique_respondents}")
73: 1872 respondents (63.29% of total unique respondents)
72: 300 respondents (10.14% of total unique respondents)
13: 122 respondents (4.12% of total unique respondents)
28: 90 respondents (3.04% of total unique respondents)

Total unique respondents: 2958

State Distribution

In [ ]:
query_state_distribution = """
SELECT
    year,
    CASE WHEN state = -1 THEN 'Unknown' ELSE state END as state,
    COUNT(DISTINCT UserID) AS Employee_count,
    SurveyID
FROM (
    SELECT 
        UserID, 
        SurveyID AS year,
        MAX(CASE WHEN QuestionID == 4 AND AnswerText != -1 THEN AnswerText END) AS state,
        MAX(CASE WHEN QuestionID == 9 THEN AnswerText END) AS Tech_company,
        MAX(CASE WHEN QuestionID == 13 THEN AnswerText END) AS Tech_role,
        SurveyID
    FROM Answer
    WHERE   
        (QuestionID == 4 AND AnswerText != -1)
        OR (QuestionID == 9 AND AnswerText == 1)
        OR (QuestionID == 13 AND AnswerText == 1)
    GROUP BY UserID, SurveyID
    HAVING state IS NOT NULL AND (Tech_company IS NOT NULL OR Tech_role IS NOT NULL) 
)
GROUP BY year, state, SurveyID
ORDER BY year, Employee_count DESC;
"""

df = pd.read_sql(query_state_distribution, cnx2)

df["Employee_count"] = df["Employee_count"].astype(int)

top_states_by_survey = (
    df.groupby("SurveyID")
    .apply(lambda x: x.nlargest(10, "Employee_count"))
    .reset_index(drop=True)
)

fig = sp.make_subplots(
    rows=1,
    cols=1,
    subplot_titles=["Top 10 US States of Work Distribution in Tech Surveys"],
)

buttons = [
    dict(
        label="All",
        method="update",
        args=[{"visible": [True] * len(top_states_by_survey)}],
    )
]

for survey_id in top_states_by_survey["SurveyID"].unique():
    button = dict(
        label=str(survey_id),
        method="update",
        args=[
            {
                "visible": [
                    survey_id == sid
                    for sid in top_states_by_survey["SurveyID"].unique()
                ]
            }
        ],
    )
    buttons.append(button)

fig.update_layout(
    updatemenus=[
        dict(
            type="buttons",
            showactive=True,
            buttons=buttons,
            x=0.1,
            xanchor="left",
            y=1.15,
            yanchor="top",
        )
    ]
)

for survey_id, color in zip(top_states_by_survey["SurveyID"].unique(), prism_colors):
    data = top_states_by_survey[top_states_by_survey["SurveyID"] == survey_id]
    trace = go.Bar(
        x=data["state"],
        y=data["Employee_count"],
        name=f"Survey {survey_id}",
        marker_color=color,
    )
    fig.add_trace(trace)

fig.update_layout(
    xaxis_title="State",
    yaxis_title="Number of Employees",
    title_x=0.5,
)

fig.show()
In [ ]:
df = pd.read_sql(query_state_distribution, cnx2)

df = df[df["state"].notna()]
df["Employee_count"] = df["Employee_count"].astype(int)
total_respondents = df["Employee_count"].sum()

top_states = df.groupby("state").sum().nlargest(10, "Employee_count")

for state, count in top_states.iterrows():
    percentage = (count["Employee_count"] / total_respondents) * 100
    print(
        f"{state}: {count['Employee_count']} respondents ({percentage:.2f}% of total)"
    )
California: 336 respondents (15.22% of total)
Illinois: 187 respondents (8.47% of total)
Washington: 150 respondents (6.79% of total)
New York: 123 respondents (5.57% of total)
Tennessee: 105 respondents (4.76% of total)
Texas: 98 respondents (4.44% of total)
Michigan: 96 respondents (4.35% of total)
Ohio: 95 respondents (4.30% of total)
Oregon: 84 respondents (3.80% of total)
Minnesota: 82 respondents (3.71% of total)

OUTCOMES:

  • On average between 42.2 to 50% of people taking part in surveys over the years are in the 25-34 age range. Another group is made up of people aged 35-44 (30.84 - 32.29%). Labour force participation in IT in the others is significantly lower.

  • Between 64.2 - 79 % of all respondents in a given year are men, the next group is made up of women 19.68 - 30 % , and the last and smallest groups are those who have reserved the provision of these data or identify themselves as being from the LGBTQIA+ community.

  • The survey garnered responses from a diverse set of countries. Notably, the United States led with 1,872 respondents (63.29% of total participants), followed by the United Kingdom (300 respondents, 10.14%), Canada (122 respondents, 4.12%), Germany (90 respondents, 3.04%), the Netherlands (70 respondents, 2.37%), Australia (50 respondents, 1.69%), and others. In total, the survey attracted 2958 respondents.

  • The surveys reflect a diverse participation from various US states. Notably, California led with 315 respondents, constituting 17.29% of the total participants, followed by Illinois (152 respondents, 8.34%), Washington (137 respondents, 7.52%), New York (105 respondents, 5.76%), Michigan (83 respondents, 4.56%), Texas (79 respondents, 4.34%), Oregon (76 respondents, 4.17%), Tennessee (76 respondents, 4.17%), Ohio (67 respondents, 3.68%), and Minnesota (65 respondents, 3.57%).

INSIGHTS:

  • The majority of survey participants fall within the 25-44 age range, with 42.2% to 50% in the 25-34 bracket. Other age groups exhibit lower representation, highlighting a concentration in the mid-career demographic.

  • Gender distribution indicates a consistent trend, with men comprising 64.2% to 79% of respondents. Women make up 19.68% to 30%, while the smallest groups include those withholding gender information or identifying as LGBTQIA+.

  • The survey draws a diverse global audience, led by the United States (63.29% of total participants). Significant contributions also come from the United Kingdom (10.14%), Canada (4.12%), Germany (3.04%), and others, totaling 2958 respondents.

  • Within the United States, California stands out as a major participant, constituting 17.29% of total respondents. Notable contributions also come from Illinois (8.34%), Washington (7.52%), and other states, offering valuable regional perspectives in the tech industry.

CAUTION: The data reveals notable disproportions across the IT landscape, raising concerns about underrepresentation in certain groups. Countries and states exhibit uneven participation, with some regions lacking representation while others dominate, introducing potential biases in the analysis of respondents' answers. Recognizing these limitations, the decision has been made to focus testing on gender and age data, where the biases are assumed to be less pronounced compared to countries and states with more diverse and underrepresented groups.

¶

Mental Health Questions Response Analysis
HYPOTHESIS: The tech industry’s complex attitudes and practices towards mental health may be influenced by stigma, potential professional repercussions, and observed mishandling of such issues at work.

Seeking Mental Health Treatment in the Past

In [ ]:
fig = go.Figure()

query_treatment = """
SELECT
    Treatment,
    SUM(CASE year WHEN 2014 THEN percentage END) AS Y2014,
    SUM(CASE year WHEN 2016 THEN percentage END) AS Y2016,
    SUM(CASE year WHEN 2017 THEN percentage END) AS Y2017,
    SUM(CASE year WHEN 2018 THEN percentage END) AS Y2018,
    SUM(CASE year WHEN 2019 THEN percentage END) AS Y2019
FROM (
    SELECT year, Treatment,
        COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY year) AS percentage
    FROM (
        SELECT SurveyID AS year, 
            MAX(CASE WHEN (QuestionID == 13 AND AnswerText == 1) THEN AnswerText END) AS Tech_company,
            MAX(CASE WHEN (QuestionID == 9 AND AnswerText == 1) THEN AnswerText END) AS Tech_role,
            MAX(CASE WHEN (QuestionID == 7) THEN AnswerText END) AS Treatment,
            COUNT (*) as count
        FROM Answer
        WHERE QuestionID IN (7, 9, 13)
        GROUP BY UserID, SurveyID
        HAVING (Tech_company IS NOT NULL OR Tech_role IS NOT NULL)
    ) AS t
    GROUP BY year, Treatment
) AS t2
GROUP BY Treatment
"""

df = pd.read_sql(query_treatment, cnx2)

df["Treatment"] = pd.to_numeric(df["Treatment"])

labels_dict = {1: "Yes", 0.5: "Maybe", 0: "No", -1: "Unknown"}

Treatment_colors = {
    "Yes": "#5f4690",
    "Maybe": "#38a6a4",
    "No": "#1d6a96",
    "Unknown": "#0f8554",
}

colors = [Treatment_colors[labels_dict[x]] for x in df["Treatment"].dropna()]

for year in df.columns[1:]:
    data = df[["Treatment", year]].dropna()
    colors = [Treatment_colors[labels_dict[x]] for x in data["Treatment"]]
    fig.add_trace(
        go.Pie(
            labels=[labels_dict[x] for x in data["Treatment"]],
            values=data[year],
            hole=0.5,
            name=str(year),
            marker=dict(colors=colors, line=dict(color="#000000", width=2)),
        )
    )
fig.update_layout(
    title="Seeking Mental Health Treatment in the Past",
    updatemenus=[
        go.layout.Updatemenu(
            active=0,
            buttons=list(
                [
                    dict(
                        label="All",
                        method="update",
                        args=[
                            {"visible": [True] * len(df.columns[1:])},
                            {
                                "title": "Seeking Mental Health Treatment in the Past - All Years",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2014",
                        method="update",
                        args=[
                            {"visible": [True, False, False, False, False]},
                            {
                                "title": "Seeking Mental Health Treatment in the Past in 2014",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2016",
                        method="update",
                        args=[
                            {"visible": [False, True, False, False, False]},
                            {
                                "title": "Seeking Mental Health Treatment in the Past in 2016",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2017",
                        method="update",
                        args=[
                            {"visible": [False, False, True, False, False]},
                            {
                                "title": "Seeking Mental Health Treatment in the Past in 2017",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2018",
                        method="update",
                        args=[
                            {"visible": [False, False, False, True, False]},
                            {
                                "title": "Seeking Mental Health Treatment in the Past in 2018",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2019",
                        method="update",
                        args=[
                            {"visible": [False, False, False, False, True]},
                            {
                                "title": "Seeking Mental Health Treatment in the Past in 2019",
                                "showactive": True,
                            },
                        ],
                    ),
                ]
            ),
        )
    ],
)

fig.update_layout(
    legend=dict(
        title="Treatment",
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1,
        traceorder="normal",
        itemsizing="constant",
        itemwidth=60,
        bordercolor="#444",
        borderwidth=1,
    )
)

fig.show()

Bringing Up Mental Health Issue Over Interview

In [ ]:
fig = go.Figure()

query_interview = """
SELECT
    Interview,
    SUM(CASE year WHEN 2014 THEN percentage END) AS Y2014,
    SUM(CASE year WHEN 2016 THEN percentage END) AS Y2016,
    SUM(CASE year WHEN 2017 THEN percentage END) AS Y2017,
    SUM(CASE year WHEN 2018 THEN percentage END) AS Y2018,
    SUM(CASE year WHEN 2019 THEN percentage END) AS Y2019
FROM (
    SELECT year, Interview,
        COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY year) AS percentage
    FROM (
        SELECT SurveyID AS year, 
        MAX(CASE WHEN (QuestionID == 13 AND AnswerText == 1) THEN AnswerText END) AS Tech_company,
        MAX(CASE WHEN (QuestionID == 9 AND AnswerText == 1) THEN AnswerText END) AS Tech_role,
        MAX(CASE WHEN (QuestionID == 12) THEN AnswerText END) AS Interview,
        COUNT (*) as count
        FROM Answer
        WHERE QuestionID IN (9, 12, 13)
        GROUP BY UserID, SurveyID
        HAVING (Tech_company IS NOT NULL OR Tech_role IS NOT NULL) 
    ) AS t
    GROUP BY year, Interview
) AS t2
GROUP BY Interview
"""

df = pd.read_sql(query_interview, cnx2)

df["Interview"] = pd.to_numeric(df["Interview"])

labels_dict = {1.0: "Yes", 0.5: "Maybe", 0.0: "No", -1.0: "Unknown"}

Interview_colors = {
    "Yes": "#5f4690",
    "No": "#1d6a96",
    "Maybe": "#38a6a4",
    "Unknown": "#0f8554",
}

colors = [Interview_colors[labels_dict[x]] for x in df["Interview"].dropna()]

for year in df.columns[1:]:
    data = df[["Interview", year]].dropna()
    colors = [Interview_colors[labels_dict[x]] for x in data["Interview"]]
    fig.add_trace(
        go.Pie(
            labels=[labels_dict[x] for x in data["Interview"]],
            values=data[year],
            hole=0.5,
            name=str(year),
            marker=dict(colors=colors, line=dict(color="#000000", width=2)),
        )
    )

fig.update_layout(
    title="Bringing Up Mental Health Issue Over Interview",
    updatemenus=[
        go.layout.Updatemenu(
            active=0,
            buttons=list(
                [
                    dict(
                        label="All",
                        method="update",
                        args=[
                            {"visible": [True] * len(df.columns[1:])},
                            {
                                "title": "Bringing Up Mental Health Issue Over Interview - All Years",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2014",
                        method="update",
                        args=[
                            {"visible": [True, False, False, False, False]},
                            {
                                "title": "Bringing Up Mental Health Issue Over Interview in 2014",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2016",
                        method="update",
                        args=[
                            {"visible": [False, True, False, False, False]},
                            {
                                "title": "Bringing Up Mental Health Issue Over Interview in 2016",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2017",
                        method="update",
                        args=[
                            {"visible": [False, False, True, False, False]},
                            {
                                "title": "Bringing Up Mental Health Issue Over Interview in 2017",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2018",
                        method="update",
                        args=[
                            {"visible": [False, False, False, True, False]},
                            {
                                "title": "Bringing Up Mental Health Issue Over Interview in 2018",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2019",
                        method="update",
                        args=[
                            {"visible": [False, False, False, False, True]},
                            {
                                "title": "Bringing Up Mental Health Issue Over Interview in 2019",
                                "showactive": True,
                            },
                        ],
                    ),
                ]
            ),
        )
    ],
)

fig.update_layout(
    legend=dict(
        title="Interview",
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1,
        traceorder="normal",
        itemsizing="constant",
        itemwidth=60,
        bordercolor="#444",
        borderwidth=1,
    )
)

fig.show()

Fear of Negative Consequences in Mental Health Discussions

In [ ]:
fig = go.Figure()

query_discussing = """
SELECT
    MAX(Discussing) AS Discussing,
    SUM(CASE year WHEN 2014 THEN percentage END) AS Y2014,
    SUM(CASE year WHEN 2016 THEN percentage END) AS Y2016,
    SUM(CASE year WHEN 2017 THEN percentage END) AS Y2017,
    SUM(CASE year WHEN 2018 THEN percentage END) AS Y2018,
    SUM(CASE year WHEN 2019 THEN percentage END) AS Y2019
FROM (
    SELECT year, MAX(Discussing) AS Discussing,
        COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY year) AS percentage
    FROM (
        SELECT SurveyID AS year, 
        MAX(CASE WHEN (QuestionID == 13 AND AnswerText == 1) THEN AnswerText END) AS Tech_company,
        MAX(CASE WHEN (QuestionID == 9 AND AnswerText == 1) THEN AnswerText END) AS Tech_role,
        MAX(CASE WHEN (QuestionID == 98 OR QuestionID == 28) THEN AnswerText END) AS Discussing,
        COUNT (*) as count
        FROM Answer
        WHERE QuestionID IN (9, 13, 98, 28)
        GROUP BY UserID, SurveyID
        HAVING (Tech_company IS NOT NULL OR Tech_role IS NOT NULL) 
    ) AS t
    GROUP BY year, Discussing
) AS t2
GROUP BY Discussing
"""

df = pd.read_sql(query_discussing, cnx2)
df["Discussing"] = clean_to_numerical(df, "Discussing")

df["Discussing"] = pd.to_numeric(df["Discussing"], errors="coerce")
df_grouped = df.groupby("Discussing").sum().reset_index()
df_grouped = df_grouped.dropna(subset=["Discussing"])

display(df_grouped)

labels_dict = {1.0: "Yes", 0.5: "Maybe", 0.0: "No", -1.0: "Unknown"}

Discussing_colors = {
    "Yes": "#5f4690",
    "No": "#1d6a96",
    "Maybe": "#38a6a4",
    "Unknown": "#0f8554",
}

colors = [Discussing_colors[labels_dict[x]] for x in df["Discussing"].dropna()]

for year in df_grouped.columns[1:]:
    data = df_grouped[["Discussing", year]].dropna()
    colors = [Discussing_colors[labels_dict[x]] for x in data["Discussing"]]
    fig.add_trace(
        go.Pie(
            labels=[labels_dict[x] for x in data["Discussing"]],
            values=data[year],
            hole=0.5,
            name=str(year),
            marker=dict(colors=colors, line=dict(color="#000000", width=2)),
        )
    )

fig.update_layout(
    title="Fear of Negative Consequences in Mental Health Discussions",
    updatemenus=[
        go.layout.Updatemenu(
            active=0,
            buttons=list(
                [
                    dict(
                        label="All",
                        method="update",
                        args=[
                            {"visible": [True] * len(df.columns[1:])},
                            {
                                "title": "Fear of Negative Consequences in Mental Health Discussions - All Years",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2014",
                        method="update",
                        args=[
                            {"visible": [True, False, False, False, False]},
                            {
                                "title": "Fear of Negative Consequences in Mental Health Discussions in 2014",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2016",
                        method="update",
                        args=[
                            {"visible": [False, True, False, False, False]},
                            {
                                "title": "Fear of Negative Consequences in Mental Health Discussions in 2016",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2017",
                        method="update",
                        args=[
                            {"visible": [False, False, True, False, False]},
                            {
                                "title": "Fear of Negative Consequences in Mental Health Discussions in 2017",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2018",
                        method="update",
                        args=[
                            {"visible": [False, False, False, True, False]},
                            {
                                "title": "Fear of Negative Consequences in Mental Health Discussions in 2018",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2019",
                        method="update",
                        args=[
                            {"visible": [False, False, False, False, True]},
                            {
                                "title": "Fear of Negative Consequences in Mental Health Discussions in 2019",
                                "showactive": True,
                            },
                        ],
                    ),
                ]
            ),
        )
    ],
)

fig.update_layout(
    legend=dict(
        title="Discussing",
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1,
        traceorder="normal",
        itemsizing="constant",
        itemwidth=60,
        bordercolor="#444",
        borderwidth=1,
    )
)

fig.show()
Discussing Y2014 Y2016 Y2017 Y2018 Y2019
0 -1.0 0.000000 17.860301 17.610063 20.448179 21.694915
1 0.0 41.707081 6.896552 6.446541 6.442577 4.406780
2 0.5 36.469447 45.092838 44.025157 43.137255 41.016949
3 1.0 21.823472 30.150309 31.918239 29.971989 32.881356

Witnessed Poor Handling of Mental Health Issues at Work

In [ ]:
fig = go.Figure()

query_handling = """
SELECT
    Handling,
    SUM(CASE year WHEN 2014 THEN percentage END) AS Y2014,
    SUM(CASE year WHEN 2016 THEN percentage END) AS Y2016,
    SUM(CASE year WHEN 2017 THEN percentage END) AS Y2017,
    SUM(CASE year WHEN 2018 THEN percentage END) AS Y2018,
    SUM(CASE year WHEN 2019 THEN percentage END) AS Y2019
FROM (
    SELECT year, Handling,
        COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY year) AS percentage
    FROM (
        SELECT SurveyID AS year, 
        MAX(CASE WHEN (QuestionID == 13 AND AnswerText == 1) THEN AnswerText END) AS Tech_company,
        MAX(CASE WHEN (QuestionID == 9 AND AnswerText == 1) THEN AnswerText END) AS Tech_role,
        MAX(CASE WHEN (QuestionID == 56 OR QuestionID == 102) THEN AnswerText END) AS Handling,
        COUNT (*) as count
        FROM Answer
        WHERE QuestionID IN (9, 13, 56, 102)
        GROUP BY UserID, SurveyID
        HAVING (Tech_company IS NOT NULL OR Tech_role IS NOT NULL) 
    ) AS t
    GROUP BY year, Handling
) AS t2
GROUP BY Handling
"""

df = pd.read_sql(query_handling, cnx2)

df["Handling"] = pd.to_numeric(df["Handling"])

labels_dict = {1.0: "Yes", 0.5: "Maybe", 0.0: "No", -1.0: "Unknown"}

Handling_colors = {
    "Yes": "#5f4690",
    "No": "#1d6a96",
    "Maybe": "#38a6a4",
    "Unknown": "#0f8554",
}

colors = [Handling_colors[labels_dict[x]] for x in df["Handling"].dropna()]

for year in df.columns[1:]:
    data = df[["Handling", year]].dropna()
    colors = [Handling_colors[labels_dict[x]] for x in data["Handling"]]
    fig.add_trace(
        go.Pie(
            labels=[labels_dict[x] for x in data["Handling"]],
            values=data[year],
            hole=0.5,
            name=str(year),
            marker=dict(colors=colors, line=dict(color="#000000", width=2)),
        )
    )

fig.update_layout(
    title="Witnessed Poor Handling of Mental Health Issues at Work",
    updatemenus=[
        go.layout.Updatemenu(
            active=0,
            buttons=list(
                [
                    dict(
                        label="All",
                        method="update",
                        args=[
                            {"visible": [True] * len(df.columns[1:])},
                            {
                                "title": "Witnessed Poor Handling of Mental Health Issues at Work - All Years",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2014",
                        method="update",
                        args=[
                            {"visible": [True, False, False, False, False]},
                            {
                                "title": "Witnessed Poor Handling of Mental Health Issues at Work in 2014",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2016",
                        method="update",
                        args=[
                            {"visible": [False, True, False, False, False]},
                            {
                                "title": "Witnessed Poor Handling of Mental Health Issues at Work in 2016",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2017",
                        method="update",
                        args=[
                            {"visible": [False, False, True, False, False]},
                            {
                                "title": "Witnessed Poor Handling of Mental Health Issues at Work in 2017",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2018",
                        method="update",
                        args=[
                            {"visible": [False, False, False, True, False]},
                            {
                                "title": "Witnessed Poor Handling of Mental Health Issues at Work in 2018",
                                "showactive": True,
                            },
                        ],
                    ),
                    dict(
                        label="Y2019",
                        method="update",
                        args=[
                            {"visible": [False, False, False, False, True]},
                            {
                                "title": "Witnessed Poor Handling of Mental Health Issues at Work in 2019",
                                "showactive": True,
                            },
                        ],
                    ),
                ]
            ),
        )
    ],
)

fig.update_layout(
    legend=dict(
        title="Handling",
        orientation="h",
        yanchor="bottom",
        y=1.02,
        xanchor="right",
        x=1,
        traceorder="normal",
        itemsizing="constant",
        itemwidth=60,
        bordercolor="#444",
        borderwidth=1,
    )
)

fig.show()

OUTCOMES:

  • Over the survey years, a significant number (ranging from 49.9% to 63.6%) of respondents acknowledged seeking mental health treatment when needed.

  • The vast majority (between 62.8% and 78.5%) of respondents indicated a reluctance to disclose mental health issues during job interviews. Notably, a substantial number (ranging from 17.7% to 29.3%) expressed uncertainty, with only a small percentage (just over 5%) openly stating they would not disclose.

  • Between 21.8% and 32.9% of respondents expressed apprehension that discussing mental health with their superiors could lead to adverse effects. The majority, ranging from 36.5% to 45.1%, responded with ‘Maybe’, indicating uncertainty about the potential consequences.

  • Between 13.6% and 35.6% of respondents reported observing poor handling of mental health issues at work. Conversely, a majority (between 39.3% and 86.4%) held the opposite view, while some respondents either remained unsure or did not disclose their answer to this question.

INSIGHTS:

  • Despite a notable majority (49.9% to 63.6%) seeking mental health treatment when needed, a substantial gap exists in the willingness to openly discuss mental health issues in the workplace.

  • Between 62.8% and 78.5% of respondents are hesitant to disclose mental health issues during job interviews, reflecting a prevalent reluctance in opening up about mental health from the outset.

  • Approximately 32.5% of respondents were unwilling to discuss mental health with their superiors, fearing potential negative repercussions. The remaining participants responded with ‘Maybe’, while a mere average of 4.41% responded with ‘No’. This data underscores the potential obstacles in promoting open discussions about mental well-being in the workplace.

  • Approximately 35.6% of respondents reported witnessing or participating in poor handling of mental health issues at work, indicating a need for improved workplace practices and awareness.

¶

Relationship Analysis
HYPOTHESIS: It is hypothesized that age and gender may significantly affect tech employees’ responses to mental health surveys. It is aimed to reveal how these factors influence attitudes and behaviors towards mental health in the tech industry.

Filtering the database

In [ ]:
filtered_db_path = "./datasets/mental_health_clean_filtered.sqlite"
valid_question_ids = [1, 2, 4, 7, 9, 12, 13, 28, 50, 56, 98, 102]

df_combined = database_filtering(
    processed_dataset, valid_question_ids, filtered_db_path
)
Columns in create_country_subquery: Index(['UserID', 'Country'], dtype='object')
Columns in create_state_subquery: Index(['UserID', 'State'], dtype='object')
Columns in create_treatment_subquery: Index(['UserID', 'Treatment'], dtype='object')
Columns in create_interview_subquery: Index(['UserID', 'Interview'], dtype='object')
In [ ]:
print(f"{df_combined['Age'].unique()}\n")
print(f"{df_combined['Gender'].unique()}\n")
print(f"{df_combined['Treatment'].unique()}\n")
print(f"{df_combined['Interview'].unique()}\n")
print(f"{df_combined['Discussing'].unique()}\n")
print(f"{df_combined['Handling'].unique()}\n")
[37 32 31 33 35 39 42 23 29 36 27 46 34 30 40 38 50 24 18 28 26 22 19 25
 44 45 43 41 21 56 60 54 55 48 20 57 58 47 62 51 49 53 61 52 63 66 59 65
 70 64 67]

['Female' 'Male' 'LGBTQIA+' 'Unknown']

[1 0]

[0.0 1.0 0.5]

[0.0 1.0 0.5 -1.0]

[0.0 1.0 0.5 -1.0]

Age - Treatment Relationship

In [ ]:
point_biserial_corr, p_value = pointbiserialr(
    df_combined["Treatment"], df_combined["Age"]
)

print(
    f"Point-Biserial Correlation: {point_biserial_corr}, p-value: {p_value} for Age vs Treatment ('Seeking Mental Health Treatment in the Past')"
)
Point-Biserial Correlation: 0.06802000977688193, p-value: 1.8625509518119982e-05 for Age vs Treatment ('Seeking Mental Health Treatment in the Past')

Age - Interview Relationship

In [ ]:
spearman_corr, p_value = spearmanr(
    df_combined["Age"], df_combined["Interview"].astype(float)
)

print(
    f"Spearman Rank Correlation: {spearman_corr}, p-value: {p_value} for Age vs Interview ('Bringing Up Mental Health Issue Over Interview')"
)
Spearman Rank Correlation: 0.0013022290674804377, p-value: 0.9347585190895162 for Age vs Interview ('Bringing Up Mental Health Issue Over Interview')

Age - Discussing Relationship

In [ ]:
spearman_corr, p_value = spearmanr(df_combined["Age"], df_combined["Discussing"])

print(
    f"Spearman Rank Correlation: {spearman_corr}, p-value: {p_value} for Age vs Discussing (Fear of Negative Consequences in Mental Health Discussions)"
)
Spearman Rank Correlation: 0.13468394108357834, p-value: 1.8137854566051697e-17 for Age vs Discussing (Fear of Negative Consequences in Mental Health Discussions)

Age - Handling Relationship

In [ ]:
spearman_corr, p_value = spearmanr(df_combined["Age"], df_combined["Handling"])

print(
    f"Spearman Rank Correlation: {spearman_corr}, p-value: {p_value} for Age vs Handling (Witnessed Poor Handling of Mental Health Issues at Work)"
)
Spearman Rank Correlation: 0.11947128064109927, p-value: 4.8058393227034167e-14 for Age vs Handling (Witnessed Poor Handling of Mental Health Issues at Work)

Gender - Treatment Relationship

In [ ]:
male_treatment = df_combined[df_combined["Gender"] == "Male"]["Treatment"]
female_treatment = df_combined[df_combined["Gender"] == "Female"]["Treatment"]
lgbtqia_treatment = df_combined[df_combined["Gender"] == "LGBTQIA+"]["Treatment"]
unknown_treatment = df_combined[df_combined["Gender"] == "Unknown"]["Treatment"]

f_statistic, p_value = f_oneway(
    male_treatment, female_treatment, lgbtqia_treatment, unknown_treatment
)

print(
    f"ANOVA F-statistic: {f_statistic}, p-value: {p_value} for Gender vs Treatment ('Seeking Mental Health Treatment in the Past')"
)
ANOVA F-statistic: 56.47830940124604, p-value: 9.657269285296993e-36 for Gender vs Treatment ('Seeking Mental Health Treatment in the Past')
In [ ]:
tukey_results = pairwise_tukeyhsd(df_combined["Treatment"], df_combined["Gender"])

print(tukey_results)
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
========================================================
 group1   group2  meandiff p-adj   lower   upper  reject
--------------------------------------------------------
  Female LGBTQIA+    0.083 0.3787 -0.0503  0.2164  False
  Female     Male  -0.2173    0.0 -0.2639 -0.1707   True
  Female  Unknown  -0.0545 0.8985 -0.2555  0.1465  False
LGBTQIA+     Male  -0.3003    0.0 -0.4295 -0.1711   True
LGBTQIA+  Unknown  -0.1375 0.4328 -0.3719  0.0969  False
    Male  Unknown   0.1628   0.15 -0.0355  0.3611  False
--------------------------------------------------------
In [ ]:
df_combined["Gender_Code"] = df_combined["Gender"].map(
    {"Male": 1, "Unknown": 2, "Female": 3, "LGBTQIA+": 4}
)
mean_values = df_combined.groupby("Gender_Code")["Treatment"].mean()

x_values = df_combined["Gender_Code"]
y_values = df_combined["Treatment"]

m, b = np.polyfit(x_values, y_values, deg=1)
y_pred = m * x_values + b
r_squared = r2_score(y_values, y_pred)

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=df_combined["Gender_Code"],
        y=df_combined["Treatment"],
        mode="markers",
        name="Data Points",
        text=df_combined["Gender_Code"],
    )
)

fig.add_trace(
    go.Scatter(
        x=df_combined["Gender_Code"],
        y=m * df_combined["Gender_Code"] + b,
        mode="lines",
        name="Regression Line",
        text=[f"Y = {m:.2f}X + {b:.2f}" for _ in df_combined["Gender_Code"]],
    )
)

fig.update_layout(
    hovermode="closest",
    hoverlabel=dict(bgcolor="white", bordercolor="black"),
    xaxis_title="Gender Code",
    yaxis_title="Treatment",
    title="Scatter Plot with Regression Line",
)

fig.show()

Gender - Interview Relationship

In [ ]:
contingency_table = pd.crosstab(df_combined["Interview"], df_combined["Gender"])
chi2, p, _, _ = chi2_contingency(contingency_table)

n = df_combined.shape[0]

categories_gender = len(df_combined["Gender"].unique())
categories_interview = len(df_combined["Interview"].unique())

cramers_v = np.sqrt(chi2 / (n * (min(categories_gender, categories_interview) - 1)))

print(
    f"Chi-square value: {chi2}, p-value: {p} for Gender vs Interview ('Bringing Up Mental Health Issue Over Interview')"
)
print(f"Cramér's V: {cramers_v}")
Chi-square value: 89.0708561088282, p-value: 4.7248453716916386e-17 for Gender vs Interview ('Bringing Up Mental Health Issue Over Interview')
Cramér's V: 0.10612910040159637

Gender - Discussing Relationship

In [ ]:
observed = pd.crosstab(df_combined["Gender"], df_combined["Discussing"])
chi2, p, _, _ = chi2_contingency(observed)

n = observed.sum().sum()
phi2 = chi2 / n
cramers_v = sqrt(phi2 / min(observed.shape[0] - 1, observed.shape[1] - 1))

print(
    f"Chi-square value: {chi2}, p-value: {p} for Gender vs Discussing (Fear of Negative Consequences in Mental Health Discussions)"
)
print(f"Cramér's V: {cramers_v}")
Chi-square value: 61.436824213018426, p-value: 7.080333712035073e-10 for Gender vs Discussing (Fear of Negative Consequences in Mental Health Discussions)
Cramér's V: 0.07196733474074374

Gender - Handling Relationship

In [ ]:
observed = pd.crosstab(df_combined["Gender"], df_combined["Handling"])
chi2, p, _, _ = chi2_contingency(observed)

n = observed.sum().sum()
phi2 = chi2 / n
cramers_v = sqrt(phi2 / min(observed.shape[0] - 1, observed.shape[1] - 1))

print(
    f"Chi-square value: {chi2}, p-value: {p} for Gender vs Handling (Witnessed Poor Handling of Mental Health Issues at Work)"
)
print(f"Cramér's V: {cramers_v}")
Chi-square value: 100.70542639162835, p-value: 1.1328077252907465e-17 for Gender vs Handling (Witnessed Poor Handling of Mental Health Issues at Work)
Cramér's V: 0.09213984481481263
In [ ]:
cnx.close()
cnx2.close()

OUTCOMES:

  • The Point-Biserial Correlation between 'Age' and 'Seeking Mental Health Treatment in the Past' is 0.068, with a p-value of 1.86e-05, suggesting a weak positive correlation. Although statistically significant, the practical significance is modest.

  • The Spearman Rank Correlation for 'Age' vs 'Interview' is 0.0013 with a p-value of 0.93, indicating no significant correlation. The observed relationship between age and bringing up mental health issues during interviews is likely due to random chance.

  • The Spearman Rank Correlation for 'Age' vs 'Fear of Negative Consequences in Mental Health Discussions' is 0.1347, with a p-value of 1.81e-17, suggesting a significant weak positive correlation. Younger individuals may be more inclined to perceive discussing mental health negatively.

  • The Spearman Rank Correlation for 'Age' vs 'Witnessed poor handling of mental health issues at work' is 0.1195, with a p-value of 4.81e-14, indicating a significant weak positive correlation. Younger individuals may have a slightly higher tendency to witness poor handling of mental health issues at work.

  • The correlation between 'Gender' and 'Seeking Mental Health Treatment in the Past' is significant (correlation coefficient = 19.38, p-value < 0.05), indicating that treatment outcomes vary significantly across different genders.

  • The Chi-square test shows a significant association between 'Gender' and 'Bringing Up Mental Health Issue Over Interview' (Chi-square = 21.98, p = 0.0002), rejecting the null hypothesis of independence.

  • The Chi-square test shows a significant association between 'Gender' and 'Fear of Negative Consequences in Mental Health Discussions' (Chi-square = 13.83, p = 0.03), rejecting the null hypothesis of independence.

  • The Chi-square test shows a significant association between 'Gender' and 'Witnessed poor handling of mental health issues at work' (Chi-square = 13.83, p = 0.03), rejecting the null hypothesis of independence.

INSIGHTS:

  • The Point-Biserial Correlation indicates a statistically significant weak positive correlation between 'Age' and the likelihood of seeking mental health treatment in the past. Although the relationship is statistically significant, the practical significance is modest.

  • The Spearman Rank Correlation for 'Age' vs 'Interview' suggests no significant correlation. The observed relationship between age and bringing up mental health issues during interviews is likely due to random chance.

  • The Spearman Rank Correlation for 'Age' vs 'Fear of Negative Consequences in Mental Health Discussions' indicates a statistically significant weak positive correlation. Younger individuals may be more inclined to perceive discussing mental health negatively.

  • The Spearman Rank Correlation for 'Age' vs 'Witnessed poor handling of mental health issues at work' suggests a statistically significant weak positive correlation. Younger individuals may have a slightly higher tendency to witness poor handling of mental health issues at work.

  • Significant differences exist in the 'Treatment' means between Male and Female groups, based on the ANOVA and post-hoc Tukey HSD test. The regression analysis suggests a positive association between 'Gender_Code' and the likelihood of seeking mental health treatment in the past. However, the model's low R2 indicates limited explanatory power, and caution is advised in drawing strong conclusions from this analysis.

  • The chi-square test reveals a statistically significant association between 'Gender' and 'Bringing Up Mental Health Issue Over Interview.' However, Cramér's V (0.1153) indicates a relatively weak practical relationship. While evidence of a connection exists, the effect size is modest, suggesting other factors may also influence participants' responses.

  • The chi-square test suggests a significant association between 'Gender' and the likelihood of perceiving discussing mental health with a boss as having negative consequences. Cramér's V (0.0915) indicates a small to modest effect size, suggesting a weak association between gender and the perception of discussing mental health.

  • There's a significant association between gender and the likelihood of witnessing poor handling of mental health issues at work. Cramér's V (0.0915) reflects a small to modest effect size, indicating a weak association between gender and the perception of poor handling of mental health issues.


¶

CONCLUSIONS

¶

SUMMARY:

  1. Distribution of Respondent Characteristics

    • Predominant age groups include 25-34 (42.2% to 50%) and 35-44 (30.84% to 32.29%), reflecting a concentration in mid-career demographics.
    • Men consistently comprise 64.2% to 79%, while women range from 19.68% to 30%. Smaller groups include those withholding gender information or identifying as LGBTQIA+.
    • The survey attracted 2958 respondents globally, led by the United States (63.29%). Diverse participation from the United Kingdom, Canada, Germany, and others contributed to the overall representation. US State Participation:
    • California leads with 17.29% of total respondents, followed by Illinois, Washington, and other states, providing valuable regional perspectives in the tech industry.
  2. Treatment Variation Across Genders:

    • Significant differences in 'Treatment' means were found between Male and Female group, based on ANOVA and post-hoc Tukey HSD test.
    • Regression analysis suggests a positive association between 'Gender_Code' and seeking mental health treatment. Females tend to have higher expected values for seeking treatment than Male.
    • The model's low R² indicates limited explanatory power, cautioning against strong conclusions.
  3. Impact of Age on Perceptions:

    • Younger individuals may be more inclined to perceive discussing mental health with their boss as having negative consequences.
    • Younger individuals may have a slightly higher tendency to witness poor handling of mental health issues at work.
  4. Gender's Influence on Mental Health Discussions:

    • Significant associations exist between 'Gender' and responses related to mental health discussions during interviews, both in seeking treatment and discussing potential negative consequences.

¶

INSIGHTS:

  1. The correlation analysis revealed no significant correlation between 'Age' and 'Seeking Mental Health Treatment in the Past' or 'Bringing Up Mental Health Issue Over Interview.' The observed relationships were weak, and the p-values exceeded the 0.05 significance level.

  2. A statistically significant weak positive correlation exists between 'Age' and the likelihood of witnessing poor handling of mental health issues at work. Younger individuals may have a slightly higher tendency to witness poor handling of mental health issues at work.

  3. A statistically significant weak negative correlation exists between 'Age' and the perceptions of discussing mental health with a boss. Younger individuals may be more inclined to perceive discussing mental health negatively.

  4. Significant differences were found in the 'Treatment' means between Male and Female groups, based on the ANOVA and post-hoc Tukey HSD test. The regression analysis suggests a positive association between 'Gender_Code' and the likelihood of seeking mental health treatment in the past. However, the model's low R2 indicates limited explanatory power, and caution is advised in drawing strong conclusions from this analysis.

  5. The chi-square test revealed a statistically significant association between 'Gender' and 'Bringing Up Mental Health Issue Over Interview.' However, Cramér's V (0.1153) indicates a relatively weak practical relationship. While evidence of a connection exists, the effect size is modest, suggesting other factors may also influence participants' responses.

  6. The chi-square test suggests a significant association between 'Gender' and the likelihood of perceiving discussing mental health with a boss as having negative consequences. Cramér's V (0.0915) indicates a small to modest effect size, suggesting a weak association between gender and the perception of discussing mental health.

  7. There's a significant association between gender and the likelihood of witnessing poor handling of mental health issues at work. Cramér's V (0.0915) reflects a small to modest effect size, indicating a weak association between gender and the perception of poor handling of mental health issues.

¶

POTENTIAL AREAS FOR INVESTIGATION:

Observations Relevant to Clients:

  1. Explore factors influencing the observed differences in treatment-seeking behavior among genders.
  2. Investigate strategies to promote mental health discussions at the workplace, particularly among younger individuals.
  3. Consider focusing on specific target groups rather than all tech workers, as there are disproportions between respondents' characteristics (Country, State, Gender, Age, etc.) which may lead to potential biases in results.

Observations Relevant to Analysts (processing would require more data):

  1. Examine additional variables that may contribute to the observed variations in treatment-seeking behavior.
  2. Further investigate the impact of workplace culture on mental health perceptions.

¶

RECOMMENDATIONS:

  1. Develop targeted interventions to encourage open mental health discussions, addressing potential concerns among different age groups.
  2. Implement workplace initiatives to create a supportive environment for mental health discussions and destigmatize seeking treatment.
  3. Consider collecting additional data on workplace-specific factors to enhance the understanding of the observed correlations.